{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Decision tree for regression\n",
    "\n",
    "In this notebook, we present how decision trees are working in regression\n",
    "problems. We show differences with the decision trees previously presented in\n",
    "a classification setting.\n",
    "\n",
    "First, we load the penguins dataset specifically for solving a regression\n",
    "problem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "penguins = pd.read_csv(\"../datasets/penguins_regression.csv\")\n",
    "\n",
    "feature_name = \"Flipper Length (mm)\"\n",
    "target_name = \"Body Mass (g)\"\n",
    "data_train, target_train = penguins[[feature_name]], penguins[target_name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To illustrate how decision trees predict in a regression setting, we create a\n",
    "synthetic dataset containing some of the possible flipper length values\n",
    "between the minimum and the maximum of the original data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "data_test = pd.DataFrame(\n",
    "    np.arange(data_train[feature_name].min(), data_train[feature_name].max()),\n",
    "    columns=[feature_name],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the term \"test\" here refers to data that was not used for training. It\n",
    "should not be confused with data coming from a train-test split, as it was\n",
    "generated in equally-spaced intervals for the visual evaluation of the\n",
    "predictions.\n",
    "\n",
    "Note that this is methodologically valid here because our objective is to get\n",
    "some intuitive understanding on the shape of the decision function of the\n",
    "learned decision trees.\n",
    "\n",
    "However, computing an evaluation metric on such a synthetic test set would be\n",
    "meaningless since the synthetic dataset does not follow the same distribution\n",
    "as the real world data on which the model would be deployed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "sns.scatterplot(\n",
    "    data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n",
    ")\n",
    "_ = plt.title(\"Illustration of the regression dataset used\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first illustrate the difference between a linear model and a decision\n",
    "tree."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "linear_model = LinearRegression()\n",
    "linear_model.fit(data_train, target_train)\n",
    "target_predicted = linear_model.predict(data_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "sns.scatterplot(\n",
    "    data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n",
    ")\n",
    "plt.plot(data_test[feature_name], target_predicted, label=\"Linear regression\")\n",
    "plt.legend()\n",
    "_ = plt.title(\"Prediction function using a LinearRegression\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the plot above, we see that a non-regularized `LinearRegression` is able to\n",
    "fit the data. A feature of this model is that all new predictions will be on\n",
    "the line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.scatterplot(\n",
    "    data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n",
    ")\n",
    "plt.plot(\n",
    "    data_test[feature_name],\n",
    "    target_predicted,\n",
    "    label=\"Linear regression\",\n",
    "    linestyle=\"--\",\n",
    ")\n",
    "plt.scatter(\n",
    "    data_test[::3],\n",
    "    target_predicted[::3],\n",
    "    label=\"Predictions\",\n",
    "    color=\"tab:orange\",\n",
    ")\n",
    "plt.legend()\n",
    "_ = plt.title(\"Prediction function using a LinearRegression\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Contrary to linear models, decision trees are non-parametric models: they do\n",
    "not make assumptions about the way data is distributed. This affects the\n",
    "prediction scheme. Repeating the above experiment highlights the differences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "tree = DecisionTreeRegressor(max_depth=1)\n",
    "tree.fit(data_train, target_train)\n",
    "target_predicted = tree.predict(data_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.scatterplot(\n",
    "    data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n",
    ")\n",
    "plt.plot(data_test[feature_name], target_predicted, label=\"Decision tree\")\n",
    "plt.legend()\n",
    "_ = plt.title(\"Prediction function using a DecisionTreeRegressor\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the decision tree model does not have an *a priori* distribution\n",
    "for the data and we do not end-up with a straight line to regress flipper\n",
    "length and body mass.\n",
    "\n",
    "Instead, we observe that the predictions of the tree are piecewise constant.\n",
    "Indeed, our feature space was split into two partitions. Let's check the tree\n",
    "structure to see what was the threshold found during the training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import plot_tree\n",
    "\n",
    "_, ax = plt.subplots(figsize=(8, 6))\n",
    "_ = plot_tree(tree, feature_names=[feature_name], ax=ax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The threshold for our feature (flipper length) is 206.5 mm. The predicted\n",
    "values on each side of the split are two constants: 3698.71 g and 5032.36 g.\n",
    "These values corresponds to the mean values of the training samples in each\n",
    "partition.\n",
    "\n",
    "In classification, we saw that increasing the depth of the tree allowed us to\n",
    "get more complex decision boundaries. Let's check the effect of increasing the\n",
    "depth in a regression setting:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tree = DecisionTreeRegressor(max_depth=3)\n",
    "tree.fit(data_train, target_train)\n",
    "target_predicted = tree.predict(data_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.scatterplot(\n",
    "    data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n",
    ")\n",
    "plt.plot(data_test[feature_name], target_predicted, label=\"Decision tree\")\n",
    "plt.legend()\n",
    "_ = plt.title(\"Prediction function using a DecisionTreeRegressor\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Increasing the depth of the tree increases the number of partitions and thus\n",
    "the number of constant values that the tree is capable of predicting.\n",
    "\n",
    "In this notebook, we highlighted the differences in behavior of a decision\n",
    "tree used in a classification problem in contrast to a regression problem."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}